PARTIAL ESTIMATION OF COVARIANCE MATRICES 



ELIZAVETA LEVINA AND ROMAN VERSHYNIN 

Abstract. A classical approach to accurately estimating the covariance 
matrix E of a p-variate normal distribution is to draw a sample of size n > p 
and form a sample covariance matrix. However, many modern applications 
operate with much smaller sample sizes, thus calling for estimation guaran- 
tees in the regime n <^ p. We show that a sample of size n = 0{mlog^ p) is 
sufhcient to accurately estimate in operator norm an arbitrary symmetric 
part of S consisting of m < n nonzero entries per row. This follows from 
a general result on estimating Hadamard products Af • E, where M is an 
arbitrary symmetric matrix. 



1. Introduction 

The problem of estimating the population covariance matrix from a sample 
of n i.i.d. observations Xi, . . . ,X„ in MF has attracted a lot of recent interest 
in statistics, with the focus on small sample sizes, n <^ p. Estimation of co- 
variance matrices plays a key role in many data analysis techniques (e.g. in 
principal component analysis, discriminant analysis, graphical models). Ap- 
plications where n <^ p have become very common in gene expression data, 
climate studies, spectroscopy, and so on. 

Suppose Xfc's are drawn from the multivariate normal distribution A^(0, E), 
where S = EXX^ is a symmetric positive-semidefinite p x p matrix. Thus S 
is the covariance matrix of that distribution. The usual estimator for E is the 
sample covariance matrix defined b}|3 



1 " 



T 

fc= 



The properties of E„ have been extensively studied 14, 11, Let us first 
consider the simpler case where S = I. The "Bai-Yin law" of random matrix 
theory determines the magnitude of the extreme eigenvalues of the Wishart 
random matrix S„ in the limit as n,p — )■ oo and n/p — )■ const. It says that the 
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^ 2 l2 

spectrum of S„ is almost surely contained in the interval + o(l), ^ + o(l)] 
where a = (i/n — and 6 = ^/ri + ^/p. (Th e up per bound is due to 

Geman [lo[, the lower bound is due to Silverstein jl7l |: see [l[ for a unified 
exposition). It follows that, with high probability, 



<2j^ + ^ + 0(l) 




where || ■ || is the spectral norm, also known as the operator or £2 matrix norm. 

A similar approximation holds for random vectors drawn from a general 
multivariate normal distribution A/^(0, S) (where E depends on p). Indeed, 
arguing as above for the random vectors S~^/^Xfc whose covariance matrix is 
identity, we easily conclude that with high probability. 



This implies that for an arbitrary precision e G (0, 1) and for p sufficiently 
large, the sample size 



More accurately, the following non-asymptotic result holds for arbitrary p G 
N, £ G (0,1) and t > 1. If the sample size satisfies n > C{tJ_e)'^p, then 



inequality (11. 3p holds with probability at least 1 — 2exp(— t^n) [18|, Remark 
51]. Here C is an absolute constant. This provides a satisfactory answer to 
the covariance estimation problem in the regime n > p for general p-variate 
normal distributions. 

We will now focus on the more difficult regime n < p, where covariance 
estimation is impossible in general (indeed, if S = J then clearly — /|| > 1 
because of the rank deficiency of E„). Still, under natural structural assump- 
tions on the covariance matrix S, a number of alternative estimation schemes 
have been proposed. 

A common assumption is that most entries of S are zero or close to 0, 
and thus can be safely estimated as 0, which reduces the effective size of the 
problem. This assumption underlies banding or tapering the covariance matrix 
in the case of ordered variables [1, [o], [l^, [sj , and thresholding of the covariance 
matrix in the case of unordered variables 0, 0, [H, H]- both cases, the 
regularized estimators have the form M ■ S„ where M is a regularizing "mask" 
matrix, and "■" denotes the Hadamard (entrywise) product of matrices. For 
banding and hard thresholding, M is a sparse matrix containing only Os and 
Is; for tapering, its entries can be anywhere in the interval [0,1], depending 
on the taper used. For thresholding, the mask M has to be estimated from S, 
whereas for banding and tapering it is determined by a tuning parameter. 



(1.2) 




(1.3) 



n > e '^p suffices for — S|| < 
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The reason regularized estimators work is best understood from the following 
decomposition: 

||M ■ ±n - S|| < \\M ■ ±n - M ■ S|| + ||M ■ S - S|| . 

The first term (variance) is well-behaved because M is sparse or close to sparse, 
and the second term (bias) is well-behaved because of the assumptions made on 
the true E. The overall rate for covariance estimation is normally computed as 
the sum of these two terms, but they come from two quite different problems: 
the first term has to do with how much of the covariance matrix we can reliably 
estimate with the corresponding part of the sample covariance matrix; and the 
second term has to do with our model assumptions. 

In this paper, we focus on the first question: how large does the sample size 
n have to he so that M ■ T^n is a good approximation to M - T,! Of course, the 
answer will depend on the size of the part we would like to estimate, but it 
almost does not depend on the ambient dimension p as we will show. 

As a simple example, consider the case where M = {rriij) is a fixed minor, 
i.e. rriij = Ijj jg^} for some a priori given index set S C {!,... ,p}. Denoting 
15*1 = m, we can easily deduce from the results above that 

(1.4) ||M-S„-M-S|| < f2./^+- + o(l)) 

Note that the quality of approximation does not depend on the ambient dimen- 
sion p but only on the number of selected variables m, as one would expect. 

The purpose of this paper is to reveal a general phenomenon behind f ll.4p . 
We shall show that, up to a necessary logarithmic factor in p, the upper bound 
in fll.4p holds for arbitrary symmetric 0/1 matrices M with at most m nonzero 
entries per column. This result is independent of any structural hypotheses on 
the locations of the nonzero entries. 

Even more generally, a stable version of (11. 4p holds. It applies to completely 
arbitrary symmetric matrices M, and therefore covers all forms of tapering 
previously considered in the literature. Instead of the sparsity parameter m, 
the bound is governed by two different operator norms of M that correspond 
to the different dependencies on m in the two terms in (11. 4p . These norms are 
the ii — )■ £2 operator norm ||M||i^2 = niaxj(^^ m^^)^/^, and the usual £2 — ^ ^2 
operator norm ||M||. 

Note that since the right hand side of (II. 3p and (11.40 depends on ||S||, these 
are relative rather than the absolute error bounds on the difference between E 
and E. The latter are more commonly analyzed in the literature, even though 
numerical results are commonly reported as relative errors to enable compar- 
isons across models. The advantage of using the relative error is disentangling 
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the convergence rates of the estimator from assumptions on the true covariance 
S, which is one of our goals in this paper. 



Throughout this paper, we consider an arbitrary mean normal distribution 
in MJ'. Its covariance matrix is denoted by S, and the sample covariance matrix 
(11.11) obtained from an i.i.d. sample of n observations by S„. Positive absolute 
constants will be denoted C, Ci, C2. 

Theorem 2.1 (Estimation of Hadamard products). Let M be an arbitrary 
fixed symmetric p x p matrix. Then 



Note that M does not depend on S„ or S. We will prove this result in 
Section HI 

Corollary 2.2 (Partial estimation). Let M be an arbitrary fixed symmetric 
p X p matrix such that all of its entries are equal to or 1, and there are at 
most m nonzero entries in each column. Then 



Proof. We note that ||M||i^2 < 11-^11 < ""^ and apply Theorem 12. II □ 

Remark 1 (Sample size). Corollary 12.21 implies that for every e G (0,1), the 
sample size 

(2.2) n > 4:C^e-^m\og\2p) suffices for E||M ■ S„ - M • S|| < e||S||. 

For sparse matrices M with m p, this makes partial estimation possible 
with n <^p observations. Therefore we regard (12.21) as a satisfactory "sparse" 
version of the classical bound (11.31) . A logarithmic term necessarily has to 
appear in (12.21) . see Remark [3] below. 

Remark 2 (Compressed sensing). Results of a similar nature arise in com- 
pressed sensing (see e.g. [sl). There one tries to reconstruct a signal x & W 
from its n linear measurements y = Ax G M". In the range n > p, where 
the problem is overdetermined, the reconstruction is achieved by inverting the 
n X p matrix A (assuming it has full rank). In the range n p, the prob- 
lem is underdetermined, and a reconstruction is only possible under certain 
structural assumptions on the signal x such as sparsity. If x has m non-zero 
coordinates, then the typical results of compressed sensing guarantee an exact 
and algorithmically effective reconstruction provided that n > mlog'^ p. 



2. Main result 



(2.1) 
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Our result fl2.2p is of a similar nature: n m log^ p observations suffice to 
estimate an m-sparse part of a covariance matrix in dimension p. A signif- 
icant difference is that the compressed sensing problem becomes easy if one 
knows the locations of the non-zero coordinates of the signal x. In contrast, 
the covariance estimation problem remains non-trivial even if one knows the 
location of the m-sparse part of the covariance matrix (which is the situation 
in Corollary 

Remark 3 (Logarithmic factor) . In Corollary 12.21 note the mild, logarithmic 
dependence on the dimension p and the optimal dependence on the sparsity 
m, which is of the same type as in (11.41) . The necessity of the logarithmic term 
in our results is evident from the example where E = M = I. Indeed, writing 
in coordinates = {X^i, . . . , X^p) G MP, we obtain 



We may compare this with the conclusion of Theorem 12.11 and Corollary 12.21 
for this example, which is 



We see that the logarithmic terms in these results are unavoidable. However, 
the present exponent 3 of the logarithm is certainly not optimal, and it can 
probably be reduced to the optimal value 1/2. 

Remark 4 (Centering). If we center the data with the sample mean X = 
n X]fc=i -^k rather than the true mean, as one would in practice, we have 



E||M ■ (S„ - XX^) - M • S|| < E||M ■ S„ - M ■ S|| + E||M ■ (XX^)|| 
Then, denoting by || I the £oo norm in R^, we can check that 



so this term can be absorbed into the second term in (12. ip . 

Remark 5 (Identifying the non-zero entries of S by thresholding) . For thresh- 
olding covariance matrices, our result is not directly applicable since the matrix 
M is random and estimated from data. However, we note that if the assump- 
tion is made that all non-zero entries in S are bounded away from zero by a 
margin of /i > 0, then a sample size of n > h~'^\og{2p) would assure that all 
their locations are estimated correctly with probability approaching 1. With 
this assumption, we could derive a bound for the thresholded estimator. Since 
our focus here is not on any one particular estimator, we do not pursue this 




E||M-S„-M-S|| < C 



log'(2p) 



E||M- (XX^)II < E||M||||X||^ < C||M||||E 



log(2p) 



n 
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extension, particularly since there are already bounds available specifically for 
thresholding ^, 0]. 

Remark 6 (Previous results). Results similar to Corollary 12 . 21 have been known 
for banding covariance matrices. They apply to the mask matrices M with m 
diagonals consisting of I's (while all other entries are zero). A result of jsj 
guarantees in this case an error bound of order m^J\ogpJn. While this bound 
has a slightly better dependence on p, it implies that the sample size n required 
for estimation grows as m^, whereas our result shows n growing linearly with 
m. Furthermore, for some tapered version of the m-diagonal matrix M, an 
error of order {m + log p)/n was obtained in (sj, which agrees with ours up 
to log p. The latter result heavily exploits the structure of the tapered banded 
matrix M, which allows one to decompose it into a weighted sum of small 
minors and reduce the problem to using (11. 4p for each minor. In contrast, our 
result does not require any structure of the mask matrix M. 

Acknowledgement. The authors are grateful to the anonymous referee for 
careful reading of the manuscript and useful suggestions. 

3. Preliminaries 

3.1. Proof outline. In the rest of this paper we prove Theorem 12.11 We shall 

observe that the quadratic form ((M ■ S„)x, y) is a Gaussian chaos (defined in 
(13.21) below) for fixed unit vectors x,y on the sphere S^~^. The main difficulty 
is to control the chaos uniformly for all x, y. This will be done by establishing 
concentration inequalities of varying power depending on the "sparsity" of 
X, y (which amounts to decoupling, conditioning, and Gaussian concentration), 
and combining this with covering arguments to "count" the number of sparse 
vectors x, y on the sphere. 

3.2. Decoupling. The following observation does not seem to be found in 
literature on decoupling (such as [^). 

Proposition 3.1 (Decoupling of sample covariance matrices). Let X be a cen- 
tered random normal vector in MP with covariance matrix S, and let Xi, . . . , X„ 
and X[, . . . ,X'j^ be independent copies of X. Consider the sample covariance 
matrix and the decoupled sample covariance matrix of X defined as 

(3.1) s„ = -VXfc®x,, s:, = -Vx^®Xfc. 

n ^-^ n ^-^ 

k=l k=l 

Then for every symmetric p x p matrix M we have 
E\\M -tn- M ■ S|| < 2E||M ■ 
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Proof. Computing the operator norm as the maximal value of the associated 
quadratic form, we have 

E||M ■ S„ - M ■ S|| = E||M ■ S„ - E(M ■ S„)|| 

= E sup \{{M ■tn)a,a) -E{{M ■tn)a,a)\. 

Writing the inner product in coordinates 



^ n p 

((M ■ S„)a, = - niijaiajXkiXkj 

k=l i,j=l 



we recognize in the right hand side a quadratic Gaussian chaos. Recall that 
generally, a quadratic Gaussian chaos in dimension p is a quadratic form 

p 

(3.2) = (^^' ^) 

where Z = {Zi, . . . ,Xp) is a centered normal random vector in MP, and the 
coefficients A = (aij)f^-^^ are taken from some subset Aof pxp matrices. The 
conclusion of the proposition then follows from the next lemma. □ 

Lemma 3.2 (Decoupling of a Gaussian chaos). Let Z be a centered normal 
random vector in M'^, and let Z' be an independent copy of Z . Let A be a 
subset of symmetric d x d matrices. Then 

E sup I {AZ, Z) - E{AZ, Z) I < 2E sup | {AZ, Z') \ . 

Lemma 13.21 allows one to replace a Gaussian chaos Yli j ^ij^i^j by its de- 
coupled version ^ij^'i^j- While it can be derived from observations in [l3|. 
Section 3.2], it is simpler to give a complete proof. 

Proof. Without loss of generality, we may assume that Z is a standard normal 
random vector. Indeed, there exists a symmetric d x d matrix T such that 
Z = Tg and Z' = Tg' for some independent standard Gaussian vectors g, g' in 
M'^. Thus {AZ,Z) = {TATg,g) and {AZ,Z') = {TATg,g'). Replacing A by 
TAT in the statement of the lemma, we see that we can assume that Z is a 
standard normal random vector. 

Using the identical distribution of Z and Z' and Jensen's inequality, we 
obtain 

E:=E sup I {AZ, Z) - E(AZ, Z) | = E sup | (AZ, Z) - E(AZ', Z') \ 

A<^A A<^A 

< Esup Z) - ^AZ\Z')\. 

A&A 
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Note the identity {AZ,Z) - {AZ\Z') = 2(A(^), By rotation in- 

variance of Gaussian measure, the pair of vectors ( ^^' , ) is distributed 
identically with {Z, Z'). Hence we conclude that 

E <2E sup \{AZ, Z')\. 

This completes the proof. □ 

Remark 7. It is not clear if a version of Decoupling Proposition 13.11 holds for 
general distributions. The argument we gave relied heavily on the rotation 
invariance of the normal distribution. 

3.3. Concentration. The following observation is a form of the rotation in- 
variance property of normal distributions. 

Lemma 3.3. Let Z = {Zi,...,Za) be a centered normal random vector in 
M*^ with covariance matrix S. Let a = {ai, . . . ,ad) € M."^. Then Yli=i^i^i 
is a centered normal random variable with standard deviation ||S-^/^a||2 < 
||S||i/2||a||2. □ 



Our proof of Theorem 12 . 1 1 will use the concentration of measure in the Gauss 
space. Such concentration results are usually stated in the literature for the 
standard normal distribution, but it is straightforward to make an adjustment 
for general normal distributions. 

Proposition 3.4 (Concentration in the Gauss space, see Section 1.1). 
Let / : R"' — )■ M &e a Lipschitz function, and denote L = ||/||Lip- Let Z be a 
centered normal vector in M.'^ with covariance matrix S. Then 

¥{fiZ) - EfiZ) > t} < i exp ( - ^j^) for all t > 0. 

Remark 8. If / is a norm on R'^, then by translation invariance, ||/||Lip equals 
the minimal number L such that 

fix) < \\x\\2 for all X G R'^. 

3.4. Discretization. The operator norm of a pxp matrix A can be computed 
via the associated bilinear form as 

\\A\\ = max {Ax, y) 

By an approximation argument, one can replace the sphere S^^^ by an e-net: 

Lemma 3.5 (Computing operator norm on a net, see [l8|). Let A be a p x p 
matrix. Let M be a 6-net of S'^~^ in the Euclidean metric for some 5 G [0, 1). 
Then 

< (1 — 5)^^ max {Ax, y). 
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We will construct a net M by rounding off the coefficients of a vector x. 

Definition 3.6 (Regular vectors). Define the subset of regular vectors of the 
sphere S^~^ as follows: 

Regp(s) = {x G S^~^ : all coordinates satisfy G {0, 1/s}}, s G [p]; 

Regp = U Regp(s). 

se[p] 

Note that | supp(a;)| = s for all x G Regp(s). 

Lemma 3.7 (Computing operator norm on regular vectors). Let A be a px p 
matrix. Then 

Pll < 12[ln(2p)]^ max {Ax,y). 

x.yGRegp 

Proof. Define /cq = [ln(2p)] and let consist of all vectors x G S^^^ such that 
for each coordinate Xj there is A; G {0, 1, . . . , ^o} such that \xi\ = 2~^. It is 
easy to check that A/" is a 0.71-net of 5^"^. 

By considering the level sets of a vector x G A/", one can represent x = 
\ix^^^ + ■ ■ ■ + XkoX^''°^ where all G [0, 1] and x^'''^ are regular vectors. Using 
this representation for x and y in Lemma [3.51 with 6 = 0.71, we conclude by 
the linearity of the inner product that the estimate in the lemma holds. □ 

Lemma 3.8 (Computing operator norm: symmetric distributions). Let A be 

a random p x p matrix such that A and are identically distributed. Then, 
for every t > 0, we have 

> < 2P|l2[ln(2p)]^ max max {Ax,y)>t]. 

Proof. We can write the conclusion of Lemma 13.71 as 

||y4|| < 12[ln(2p)]^ max max {Ax,y). 

r,se[p] x£Regp{r) 

Taking the maximum separately over r < s and r > s and replacing the 
maximum by the sum, we have 

(3.3) < 12[ln(2p)]^ max ( max max {Ax,y), max max {Ax,y)]. 

\ r,sG[p] xGRegp{r) r,sG[p] xGRegp{r) / 

r<s yeRegpis) s<r yeReg^is) 

By the symmetry assumption on the distribution of A, the random variables 
{Ax, y) and {Ay, x) are identically distributed for all x, y. Therefore, the two 
double maxima in (13. 3p are identically distributed. This yields the conclusion 
of the lemma. □ 



10 



ELIZAVETA LEVINA AND ROMAN VERSHYNIN 



4. Proof of Theorem 12.11 



4.1. Decoupling and conditioning. By rescaling, we can assume without 
loss of generality that = 1. We shall denote the entries of M by rriij. 

By Proposition 13. H it suffices to estimate E||M ■ E^||, where denotes 
the decoupled sample covariance matrix defined in (13. ip . By the definition of 
S'^ and the symmetry of M, the random matrices M ■ and (M • E^)-^ are 
identically distributed. Therefore, Lemma 13.81 applies and we get for every 
t > that 



Writing the inner product in coordinates and rearranging the terms, we have 



Let us fix X and y and condition on the random vectors (Xi, . . . , X„). Then 
the expression in (14. 2 p defines a centered normal random variable. We shall 
estimate its standard deviation by Lemma 13.31 Since the covariance matrix S 
of each vector G W has operator norm at most 1, the covariance matrix 
of the concatenated vector (X^)^^]^ G also has operator norm at most 
1. Then Lemma 13.31 yields that the expression in (14. 2 p is a centered normal 
random variable with standard deviation at most o"^(Xi, . . . ,X„)||?/||oo where 




max max 

r,se[p] a;eRegp(r) 




(4.2) 





We will need to bound this quantity uniformly for all x. 
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4.2. Concentration. Let x G Regp(r). We will estimate ax{Xi, . . . ,X„) us- 
ing concentration in Gauss space, Proposition 13.41 First, we have 

Ea,.(Xi, ...,Xn)< (Ea,.(Xi, . . . 

n 



■ k=l i=l j = l 

I . ^ P P .1/2 

< — ( "^ij^^l ) (by Lemma [331 and = 1) 

^ k=l i=l j=l 
1 2V^^ n 

< max > m,-- (because a; U = 1) 



n ieb] . ^ 

(4.3) < Mfjhl (by definition). 



Next, we consider '■ M^" — M as a function of the concatenated Gaussian 
vector [Xi, . . . , X„) G M^". Computing the Lipschitz norm of ax becomes easy 
once we write this function as 



1 " 

a.(Xi,...,X„) = -($^||M(x-Xfc)||2) 

k=l 



1/2 



where as usual x ■ Xk denotes the Hadamard (coordinate-wise) product of 
vectors, and the multiplication by M is the ordinary (matrix) multiplication. 
Separating M and x, we obtain the bound 



1 " 
a,(Xi,...,X„) < -||M||||x|U( 

k=l 



2 

k\\2 



1/2 



Using that ||a;||oo = since x G Regp(r), we conclude that 

ax{Xi, . . . ,Xn) < ■ \\{Xi, . . . ,Xn)\\2- 

'r n 



By the remark below Proposition 13.41 we have proved that 

(AA^ II II ^ ll^ll 

(4-4) l|o-a-||Lip < 

y'rn 

In addition, since the covariance matrix S of each vector Xk G MP has op- 
erator norm at most 1, the covariance matrix of the concatenated vector 
(Xi,...,X„) G M^" also has operator norm at most 1. From this and the 
bounds on the expectation f l4.3p and on the Lipschitz norm (14.41) . we conclude 
by Proposition 13.41 that for all x G Regp(r) and t > 0, 

||M||i,2 , ,1 / 1 f tW 



(4.5) p{a,(Xi, . . . , X„) > + t} < ^ exp ( 



2||M|P 
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4.3. Union bounds. We return to estimating the random variable ((M ■ 
T,'^)x,y) which we initiated in Section 14. 1[ Let us fix u > 1. For each 
X G Regp(r), we consider the events 

I ^/n n ) 

By (14. 5p . we have 

(4.6) ¥{£^) > 1 - ^ exp ( - 5nV ln(2ep)) . 

Note that the function cr^(Xi, . . . and thus also the events £x are inde- 
pendent of the random variables (X(, . . . , X^). 

Let X G Regp(r) and y G Regp(s). As we noted in Section WA\ conditioned 
on a realization of random variables (Xi, . . . ,X„) satisfying E^., the random 
variable ((M-S^)x, y) is distributed identically with a centered normal random 
variable h whose standard deviation is bounded by 



a^^Xi, . . .,Xn)\\y\\oo < "^"]r^'^ + M-i^^^A/l01n(2ep) =: a. 

yj sn yjsn 

Then by the usual tail estimate for Gaussian random variables, we have 
P{((M ■ S'Jx, y)>e\Sx}<]^ e^-^V'^^^) for e > 0. 

Choosing 

(4.7) e = e(u) ■.= 2V3 J^}}'^ J\n(2ep) + 2V30u^^^\n(2ep), 

y/n n 

we obtain 

P{((M-S'Jx,y) >£ 1^4 < ^ exp ( - Gm^s ln(2ep)) 

for all X G Regp(r), y G Regp(s). We would like to take the union bound in 
this estimate over all y G Regp(s). Note that 



|M| 



Regp(s)| = (^^2' < exp (sln(2ep/s)) 



as there are (^) ways to choose the support and 2^ ways to choose the signs of 
the coefficients of a vector in Regp(s). Then 

P| max ((M-S;)x,|/) >e\£x]< - exp (sln(2ep/s)) exp ( - 6M2sln(2ep)) 

< - exp ( — bu^s ln(2ep)) 
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as M > 1. Therefore, using (14. 6p . we have 

P{ max ((M-S'Jx,y)>£}<P{ max ((M • > £ | + P{^,^} 

yeRegp(s) j/eRcgp(s) 

(4.8) < - exp ( — 5n^s ln(2ep)) + - exp ( — 5u^r ln(2ep)) . 

Now we take a further union bound over x G Regp(r) for fixed r,p & [p], 
r < s. In this range, the second term in (14. 8 p dominates. Estimating as before 
I R'6gp(r)| < exp(r ln(2ep/r)), we obtain that 

P| max ((M ■ T,'^)x,y) > < exp (rln(2ep/r)) exp ( — 5M^rln(2ep)) 

a;eRegp(r) 
yeRegj,{s) 

< exp ( — Au^r ln(2ep)) . 

Finally, we take the union bound over all allowed pairs r, s. Since r > 1 and 
M > 1, we have 

P{ max max ((M ■ S'„)a;, y) > e\ < exp ( — 4m^ ln(2ep)) 

r,se[p] xeRegj,(r) 
'■<^ ?;GRegp(s) 

< exp ( -2m^ ln(2ep)). 

Using (14. ip . we have shown that 

P{||M-S;|| > 12\\n{2p)Ye} < 2 exp ( - 2^^ ln(2ep)) 
for all M > 1 and for e = e{u) defined in (14. 7p . Integration yields 

(4.9) E||M-E;|| <84^^^rin(2ep)f/' + 263Mpn(2ep)f. 

\/n n 



Decoupling Proposition 13 . II completes the proof of Theorem 12. 11 giving also an 
explicit bound on the absolute constant C and a slightly better dependence 
on p. □ 

Remark 9 (Arbitrary distributions). It is likely that the results of this paper 
generalize from Gaussian to arbitrary distributions in W with enough mo- 
ments. However, it is not clear whether a version of Decoupling Proposition l3.ll 
holds for general distributions. 
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